PROSPER DATA EXPLORATION by JUNAID

I chose Prosper data set for exploration since I was eager to learn more about the loans and their relationship with the borrowers, especially factors that indicate successful repayment of borrowed amount in future. I also wanted to gain experience by working with datasets from banking industry.

The dataset is huge and contains a lot of variables. It is not possible to analyse all the variables in this project. Variables visible in a listing profile are preferred and variables related to them are selected. Totally there are 16 variables in the subsetted dataframe. An example of a listing can be found in this link https://www.prosper.com/help/topics/how-to-read-a-loan-listing/.

Lets have a look at the dataframe.

##  [1] "Term"                      "LoanStatus"               
##  [3] "BorrowerAPR"               "ProsperRating..numeric."  
##  [5] "ProsperRating..Alpha."     "ListingCategory..numeric."
##  [7] "Occupation"                "IsBorrowerHomeowner"      
##  [9] "CreditScoreRangeLower"     "CreditScoreRangeUpper"    
## [11] "DelinquenciesLast7Years"   "BankcardUtilization"      
## [13] "DebtToIncomeRatio"         "StatedMonthlyIncome"      
## [15] "LoanOriginalAmount"        "ListingCategory"
## 'data.frame':    113937 obs. of  16 variables:
##  $ Term                     : Ord.factor w/ 3 levels "12"<"36"<"60": 2 2 2 2 2 3 2 2 2 2 ...
##  $ LoanStatus               : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ BorrowerAPR              : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ ProsperRating..numeric.  : Ord.factor w/ 8 levels "0"<"1"<"2"<"3"<..: 1 7 1 7 4 6 3 5 8 8 ...
##  $ ProsperRating..Alpha.    : Ord.factor w/ 7 levels "HR"<"E"<"D"<"C"<..: NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ListingCategory..numeric.: int  0 2 0 16 2 1 1 2 7 7 ...
##  $ Occupation               : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ IsBorrowerHomeowner      : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CreditScoreRangeLower    : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper    : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ DelinquenciesLast7Years  : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ BankcardUtilization      : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ DebtToIncomeRatio        : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ StatedMonthlyIncome      : num  3083 6125 2083 2875 9583 ...
##  $ LoanOriginalAmount       : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ ListingCategory          : Factor w/ 21 levels "Not Available",..: 1 3 1 17 3 2 2 3 8 8 ...
##  Term                       LoanStatus     BorrowerAPR     
##  12: 1614   Current              :56576   Min.   :0.00653  
##  36:87778   Completed            :38074   1st Qu.:0.15629  
##  60:24545   Chargedoff           :11992   Median :0.20976  
##             Defaulted            : 5018   Mean   :0.21883  
##             Past Due (1-15 days) :  806   3rd Qu.:0.28381  
##             Past Due (31-60 days):  363   Max.   :0.51229  
##             (Other)              : 1108   NA's   :25       
##  ProsperRating..numeric. ProsperRating..Alpha. ListingCategory..numeric.
##  0      :29084           C      :18345         Min.   : 0.000           
##  4      :18345           B      :15581         1st Qu.: 1.000           
##  5      :15581           A      :14551         Median : 1.000           
##  6      :14551           D      :14274         Mean   : 2.774           
##  3      :14274           E      : 9795         3rd Qu.: 3.000           
##  2      : 9795           (Other):12307         Max.   :20.000           
##  (Other):12307           NA's   :29084                                  
##                     Occupation    IsBorrowerHomeowner
##  Other                   :28617   False:56459        
##  Professional            :13628   True :57478        
##  Computer Programmer     : 4478                      
##  Executive               : 4311                      
##  Teacher                 : 3759                      
##  Administrative Assistant: 3688                      
##  (Other)                 :55456                      
##  CreditScoreRangeLower CreditScoreRangeUpper DelinquenciesLast7Years
##  Min.   :  0.0         Min.   : 19.0         Min.   : 0.000         
##  1st Qu.:660.0         1st Qu.:679.0         1st Qu.: 0.000         
##  Median :680.0         Median :699.0         Median : 0.000         
##  Mean   :685.6         Mean   :704.6         Mean   : 4.155         
##  3rd Qu.:720.0         3rd Qu.:739.0         3rd Qu.: 3.000         
##  Max.   :880.0         Max.   :899.0         Max.   :99.000         
##  NA's   :591           NA's   :591           NA's   :990            
##  BankcardUtilization DebtToIncomeRatio StatedMonthlyIncome
##  Min.   :0.000       Min.   : 0.000    Min.   :      0    
##  1st Qu.:0.310       1st Qu.: 0.140    1st Qu.:   3200    
##  Median :0.600       Median : 0.220    Median :   4667    
##  Mean   :0.561       Mean   : 0.276    Mean   :   5608    
##  3rd Qu.:0.840       3rd Qu.: 0.320    3rd Qu.:   6825    
##  Max.   :5.950       Max.   :10.010    Max.   :1750003    
##  NA's   :7604        NA's   :8554                         
##  LoanOriginalAmount           ListingCategory 
##  Min.   : 1000      Debt Consolidation:58308  
##  1st Qu.: 4000      Not Available     :16965  
##  Median : 6500      Other             :10494  
##  Mean   : 8337      Home Improvement  : 7433  
##  3rd Qu.:12000      Business          : 7189  
##  Max.   :35000      Auto              : 2572  
##                     (Other)           :10976

We see that the dataframe contains 113937 rows. The variable for ListingCategory (numeric) is a factor. Its levels are renamed and stored in another variable called ListingCategory. Thus, the variables are increased from the selected 15 to 16.

Univariate Plots Section

Data Overview

Loan Term

There are a lot of loans with a loan term of 36 months (3 years). 60 months and 12 months loans are less in number. This means that either the borrowers or the investors prefer a 3 year period compared to the other two options. Since we don’t have the applicant data, we cannot conclude who likely prefers 36 months.

Loan Status

In this bar chart, it is clear that most loan data is for current loans followed by successfully completed loans and then chargedoff loans. Past due levels consist the days delayed information. They are grouped into one varible called Past_due.

Loan Amount

The histogram tells that the most common amount is 4000 followed by 10000 and 15000.

Loan purpose

## 
##      Not Available Debt Consolidation   Home Improvement 
##              16965              58308               7433 
##           Business      Personal Loan        Student Use 
##               7189               2395                756 
##               Auto              Other      Baby&Adoption 
##               2572              10494                199 
##               Boat Cosmetic Procedure    Engagement Ring 
##                 85                 91                217 
##        Green Loans Household Expenses    Large Purchases 
##                 59               1996                876 
##     Medical/Dental         Motorcycle                 RV 
##               1522                304                 52 
##              Taxes           Vacation      Wedding Loans 
##                885                768                771

It is seen that many people use Prosper loans for Debt consolidation than any other purpose. From the table of Loan category, the number of debt considation loans is found to be 58308. Now, lets look at the percentage at which loan amount is borrowed.

Borrower Annual Percentage Rate

We see that the borrower Annual Percentage Rate follows a normal distribution with an outlier at BorrowerAPR 0.36. This pattern is similiar for BorrowerRate as well since it is derived from BorrowerAPR.

Borrower Income

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3200    4667    5608    6825 1750000

The income of the borrower is plotted after removing the top 1% values. We can see that the distribution is skewed to the right with mode around $4500. Therefore, we can say that most of the borrowers earn in that range.

Again removing the top 1% values from debt to income ratio, we see that the distribution is slightly skewed to the right. We remove the top 1% values, to filter the outliers. From the definition, we know that the maximum value for Debt to income ratio is 10.01. Thus from the plot, we can infer that 99% of the borrowers in the dataset have ratio less than 1, which means that their debts value does not exceed their income.

Bankcard utilization

It is a distribution with two peaks, one at zero and the other at one. It means that some people do not have a bankcard or do not use them, while some people use them completely to their limit. There are also certain people, who have exceeded their limit on the card.

Number of rows with BankcardUtilization less than 0.05

## [1] 9361

Number of rows with BankcardUtilization less than 0.05

## [1] 10362

Occupation of the borrower

There are variety of occupations present in the dataset. This is also evident from the fact that ‘other’ category has the the highest count. Nothing much can be concluded from this information.

Delinquecies in last 7 years

Number of borrowers with zero delinquencies

## [1] 76439

It is seen that most borrowers have no delinquencies in the past 7 years. To be exact, 76439 borrowers did not have any delinquencies.

Prosper rating of the borrowers

From the ratings available in the dataset, we see that there are a lot borrowers with C rating. AA rated borrowers are the least of all categories.

Credit Score

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    19.0   679.0   699.0   704.6   739.0   899.0     591

We can see that there are discrete bars in the histogram since credit scores are integer values. It is a normal distribution with median score of 699. Upper range of credit scores was used for this plot. The lower range must also exhibit similiar behaviour.

Univariate Analysis

What is the structure of your dataset?

There are 113937 unique loan data in the dataframe with 30 variables. Out of these, ProsperRating..numeric., ProsperRating..Alpha. are ordered factors with levels described below.

(worst) ——> (best)

  1. ProsperRating..numeric.: 0,1,2,3,4,5,6,7
  2. ProsperRating..Alpha.: HR, E, D, C, B, A, AA

There are many factor variables without any order such as LoanStatus, Occupation, IsBorrowerHomeowner, ListingCategory. the levels of these variables are listed below.

  1. LoanStatus: Cancelled, Chargedoff, Completed, Current, Defaulted,
    FinalPaymentInProgress, Past Due (>120 days), Past Due (1-15 days), Past Due (16-30 days), Past Due (31-60 days), Past Due (61-90 days), Past Due (91-120 days)
  2. Occupation: 68 levels like Accountant/CPA, Administrative Assistant, Analyst.
  3. IsBorowerHomeowner: TRUE, FALSE
  4. ListingCategory: 21 levels like Not Available, Debt Consolidation, Home Improvement.

Interesting Observations:
Most preferred term is 36 months
75% of the Credit Scores are under 739
75% of the loan amounts are under $12000
Very small number of Borrowers have delinquencies in the past 7 years
10.5% of all loans have been chargedoff

What is/are the main feature(s) of interest in your dataset?

I would like to estimate the relationship between the borrower information and their likelihood to default. To find this the LoanStatus and the various ratings might play a major role in the analysis. This analysis will then be useful for investors to avoid risky borrowers

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Other variables such as debt to income ratio, Bank card utilization ratio, delinquencies in the past years must also affect the outcome.

Did you create any new variables from existing variables in the dataset?

The ListingCategory (numeric) was a factor with integer levels. I converted them to appropriate cateogies such as ‘Home loan’ and assigned it to a new factor ListingCategory.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

In the Loan status histogram , there were a lot of ‘past due’ levels for various distributions. It made the plot difficult to understand. Hence the past due levels were grouped together to form a single level. It then gave a clear picture of the loan status.

Bivariate Plots Section

Let us examine the relationship between loan status and monthly income

It is interesting to note that the the Loan status of ‘Cancelled’, ‘Chargedoff’ or ‘Defaulted’ have low median values of monthly income compared to other categories of loan statuses. It is one of the factors that might be affecting borrowers unable to payback their loan.

The statistics of the three categories: Chargedoff, Defaulted and Cancelled is listed below.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    2500    3750    4486    5500  208300
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    2500    3708    4367    5417   58620
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    2445    2600    2609    3833    4167

Now that we know low income might be correlated with defaulting. Let us look if the annual percentage rate affects as well.

The median rate for the loans seems to be less for completed loans than those not completed. Categories such as Chargedoff, cancelled, Pastdue and Defaulted have higher median APR than those completed. This implies that the rate could be a burden on the borrower. From the scatter plot it is also evident that bulk of the data for Chargedoff and Defaulted categories has APR between 0.3 and 0.4. It is represented by the darker lines in that region.

There is no clear conclusion from this graph. Borrowers of all Prosper ratings are present in all categories of loan statuses. After all, prosper ratings seem to be not so effective in categorising the risky borrowers from non risky ones.

We expect credit score to be a good representative of identifying risky borrowers.

From this graph, it is seen that the credit scores of defaulters are lower than those who have completed the loans. It seems to be good predictor as well. We took upper limit of the credit score for this graph. Lets check whether this holds true for the lower limit as well.

Indeed yes. Credit scores seem to be correlated with borrower defaulting.

Now lelt us look if debts of the borrower relative to income affect the loan outcome.

There doesn’t seem to be a relationship between the debt to income ratio and defaulting. Borrowers who have defaulted do not have huge debt ratios to make payments impossible. The ratio distribution in all groups seems random and there is no correlation. This is a dead end.

Here as well, we see no noticable difference between the loan amounts and the status of the loan.

Most commom loan term is 36 months. It looks like all loans of 12 month peroid have been completed. However, there is no relationship of the other two terms with loan completion. Thus, loan term of 12 months looks less risky.

The histogram displays bank card utilization ratio and the corresponding loan status. We can see that chargedoff loans are spread across all bank card utilization values. There doesn’t seem to be a relationship between the two variables.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

In this Bivariate analysis, LoanStatus was tested for correlation with variables StatedMonthlyIncome, BorrowerAPR, ProsperRating, CreditScoreRangeUpper, CreditScoreRangeLower, DebtToIncomeRatio and LoanOriginalAmount. The LoanStatus, i.e. the likelihood of defaulting, or loan being Chargedoff was found to be correlated with the monthly income of the borrower. The median values of defaulters are less compared to those who have successfully completed their loan. Since LoanStatus is a categorical variable, the correlation coefficient with the monthly income cannot be measured.

Loan status was also correlated with the Annual Percentage rate charged on the loan from the borrower. A boxplot was used to check for relationship and it was found that higher median APR (greater than 0.25) corresponded with default and Chargedoff loans. Completed loans had a median APR less than 0.2.

Credit scores were also a good method to predict the risk involved with a borrower. However, the correlation was weak. There was only a slight difference in the credit scores for borrowers whose loans were chargedoff and for those who successfully cloased it.

In terms of loan period, it was seen that all 12 month loans were completed. 36 months was the most common loan term. But there was no relationship of 36 and 60 months term with completion. Hence borrowers with a small term loan were less risky to invest.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Both upper and lower limit of credit scores gave similiar results. They perfectly correlated with each other.

Another surprise was that Prosper ratings given to the borrowers had no relationship whn compared with the resulting loan statuses (those completed and chargedoff). Thus the Prosper rating system does not seem to work as intended.

What was the strongest relationship you found?

Even though the value could not be quantified, there was strong relationship between monthly income of borrowers and the final loan status. Hence income could be used to predict risk of a borrower along with factors such as credit scores.

Multivariate Plots Section

If we carefully observe the graph, we an see that there are a lot of chargedoff loans for high debt to income ratio and low Credit scores. The area of the graph above credit score of above 650 and debt income ratio less than 2.5 has a lot of completed loans than the rest of the graph.

Lets look at the summaries of each facet from HR to AA

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2026  0.3564  0.3580  0.3571  0.3580  0.4240
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1743  0.3355  0.3513  0.3450  0.3564  0.4068
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1406  0.2834  0.2939  0.2917  0.3053  0.3915
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1116  0.2236  0.2390  0.2382  0.2526  0.3915
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.08999 0.17220 0.18410 0.18720 0.20200 0.37450
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.07045 0.11770 0.12780 0.13370 0.14660 0.24810
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04583 0.07371 0.08341 0.08649 0.09643 0.21370

We see that the median values of Borrower rate vary according to the prosper rating of the borrower. Borrowers with good rating get lower percentages, whereas those with lower ratings get higher percentages. For any single rating, we can also see that the median rate for those completed is slightly lesser that those who didn’t.

After taking ratio of creditScore to Debtincome ratio and dividing it by 900, the varible was plotted against borrower rate. The borrower rate was transformed to log scale. From this scatter plot, we can observe that bulk of chargedoff loans are present only after a rate of 0.10 and they increase as the borrower rate increases. These charged off loans also seem to be concentrated near the lower values of y axis, i.e for lower credit score and higher debt to income ratio.

Now lets look at the effect of term.

Here, we can see that 12 month loans have very little chargeoff and their borrowing rate is above 10%. 36 months is the most common loan term, here almost all loans with rate less than 10% have been completed. It looks like 60 month term has high chance of a loan being chargedoff, since around 2/3 rd of the loans are not completed. The borrowing percentage is also really high. It may be prosper policy to charge higher rates for long loan periods.

Lets check if homeownership affects final outcome.

There is no big relationship found here. One small thing that can noticed is that there are more Chargedoff loans when the borrower is not a homeowner.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I found that high DebtToIncomeRatio and lower credit scores meant chargedoff loans. Only when both the variables were at the desired levels, the completion rate was high. The desired levels were high credit scores and low DebtToIncomeRatio. Another major finding was that the completion of 12 month loans was very high and low borrower rate corresponded with high completion

Were there any interesting or surprising interactions between features?

In the bar graph of LoanStatus faceted by loan Terms. I was surprised to see that 60 month term loans had very high incompletion (almost 2/3rd). This was not true for the other two terms of 12 month and 36 month.


Final Plots and Summary

Plot One

Description One

This plot describes that completed loans have the lowest median monthly income compared with Chargedoff and Defaulted loans. It seems logical that people who have well paying jobs are better positioned to payback the loan amount than people who don’t have a job. It should be noted that the graph is in terms of monthly income, the difference will be much more pronounced for annual income.

Plot Two

Description Two

This is a graph that describes the relationship of four variables with loan completion. Here the borrower rate is represented in a log scale. It is found that loans chargedoff and defaulted are concentrated above borrower rate of 10% and for a low value of the ratio between credit score and debt to income ratio.

Plot Three

Description Three

This graph summarises the findings from this exploratory data analysis. These three variables had the most impact on whether a loan was chargedoff or completed. In this graph, the mean values of each variable grouped by Loan status is plotted. For Borrower rate, low value is preferred. For Monthly income and credit score, higher values are preferred in the range of $5400 and 710 respectively.


Reflection

Start of the project was the most diffcult phase. It was challenging to understand the dataset and the variables. Since there are 81 variables in this dataset, I had to limit them, so that the selected variables could be analysed within the scope of this project. It was a tough task. I decided to select variables that were important to answer my question of predicting defaulting using borrower information. I alss used the listing example on Prosper website to narrow down variables. In the univariate analysis, I plotted various bars and histograms to understand the variables chosen. Then from those, I proceded to indentify relationship between variables in Bivariate analysis. Here my main focus was to find relationship between loan status and other selected variables. I found that Credit score, monthly income, borrower annual percentage rate and term had an effect on the final outcome of the loan. In multivariate analysis, I explored the relationship by combining varibles identified from previous section. This provided good answers about borrowers and their likelihood to default.

I hope my project will help investors in classify borrowers according to their risk and help them make informed decision. However, my project is by no way a complete analysis of the dataset. There is a lot of scope to improve this by analysing the variables not selected in this project. For.eg. The impact of variables such as LenderYield, EstimatedLoss, etc can have an impact on the final outcome. LenderYield and EstimatedLoss determine the amount returned to the investor and that may impact his decision to invest in a borrower. Moreover, groupmembership, recommendations of the borrower might also affect investment decision on top of the variables discussed in this project.

I feel that developing a model to predict would be the next step for this project. This model would predict a score based on all the variables that are found to affect the loan outcome. Then a threshold value of the score developed could be set to determine investment decisions and avoid risky borrowers. It could be done by multiple linear regression.